Method and kit for discriminating between breast cancer and benign breast disease

ABSTRACT

A method and kit for discriminating between breast cancer and benign breast disease by the determination of the expression level of at least one target gene having a nucleic acid sequence selected from the nucleic acid sequences set forth in SEQ ID NOs: 1, 2 or 3, 4 and 5 or 6 to obtain an expression profile for the patient, and the comparison of the expression profile of the patient with expression profiles of target genes from patients previously clinically classified as breast cancer and expression profiles of target genes from patients previously clinically classified as benign breast disease.

This is a divisional of application Ser. No. 13/696,937 filed Nov. 28, 2012, which is a National Stage Application of PCT/CN2010/073342 filed May 28, 2010. The entire disclosures of the prior applications are hereby incorporated by reference herein in their entirety.

FIELD OF THE INVENTION

The present invention relates to the filed of the discrimination between breast cancer and benign breast disease. Particularly, the present invention relates to a method and kit for discriminating between breast cancer and benign breast disease.

BACKGROUND

Breast cancer is the most common cancer in women in the world. As the pathogenesis of breast cancer is inadequately understood, the early diagnosis seemed much of significance. Currently, mammogram screening is the most frequent method for the breast cancer detection. It can be used to reduce breast cancer morbidity by 20 to 40 percent in the age of 40 to 69 women, which has been proved by several large randomized trials. Mammography is currently the gold standard for early breast cancer detection while the reported overall sensitivity is significantly reduced in certain subsets of women, particularly in women with radiographically dense breasts and those at increased risk of breast cancer. Estimates of film mammographic sensitivity in women with extremely dense breasts range from 48 to 63%. Mammography has the disadvantage of low sensitivity and specificity, especially in the young group, and a compression pain during the process. In addition, due to small volume and high-density breast, many cases failed to obtain a clear result of their mammography in the screening, which are often classified as BI-RADS 0 (BI-RADS: Breast Imaging Reporting and Data System) in their mammographic diagnosis.

The BI-RADS was developed in 1993 by the American College of Radiology (ACR) to standardize mammographic reporting, to improve communication, to reduce confusion regarding mammographic findings, to aid research, and to facilitate outcomes monitoring. According to the Mammography Quality Standards Act (MQSA) of 1997 [Final Rule 62(208): 55988], all mammograms in the United States must be reported using one of these assessment categories. Each mammographic study should be assigned a single assessment based on the most concerning findings. Classifications are divided into an incomplete assessment (category 0) and completed assessments (categories 1, 2, 3, 4, 5, 6). BI-RADS Category 0 is defined as an incomplete assessment, which means additional imaging needed. Follow-up is usually recommended, which requires a long, expensive and anxiety producing process, based on ultrasonography or magnetic resonance imaging (MRI) or even biopsy. Ultrasonography, even combined with mammography, is associated with high rate of false positive results which led to unnecessary invasive steps. The long term of reservation of MRI is detrimental to the patients. MRI also brings a high rate of false positive result, together with a high cost. With such a variety of factors, the need of a new easy-to go test that would improve breast cancer detection and demonstrate the risk of patients, particularly when mammography cannot be identified, is highly important.

The serum biomarker, such as CEA, CA15-3, does not show a good performance in the cancer screening [1]. Recently, there is some literature describe the possibility of early diagnosis of breast cancer using gene-expression patterns in peripheral blood cells [2]. The result of these pilot studies indicate that cancer would cause characteristic changes in the biochemical environment of blood, and as a result of that the expression pattern of some identified genes can be used to discriminate cancer and control group with high accuracy. However, no alternative based on blood biomarkers has yet succeeded to discriminate within the BI-RADS 0 patients, between breast cancer (BC) and benign breast disease (BBD).

SUMMARY OF THE INVENTION

The present invention provides a method for discriminating between breast cancer and benign breast disease in a biological sample from a patient, wherein it comprises the following steps: a) obtaining the biological sample comprising a biological material from the patient, b) contacting the biological material from the biological sample with at least one specific reagent for at least one target gene and no more than 28 specific reagents for 28 target genes comprising the nucleic acid sequences set forth in SEQ ID NOs 1 to 44, wherein the at least one reagent is specific for at least a target gene comprising a nucleic acid sequence selected from the nucleic acid sequences set forth in SEQ ID NOs: 1, 2 or 3, 4 and 5 or 6, and c) determining the expression level of at least one target gene comprising a nucleic acid sequence selected from the nucleic acid sequences set forth in SEQ ID NOs: 1, 2 or 3, 4 and 5 or 6 to obtain an expression profile for the patient, and d) performing analysis of the expression profile of the patient with expression profiles of target genes from patients previously clinically classified as breast cancer and expression profiles of target genes from patients previously clinically classified as benign breast disease, wherein: if the expression profile of the patient is clustered with the expression profiles from patients previously clinically classified as breast cancer, then the patient is prognosticated to have breast cancer, and if the expression profile of the patient is clustered with the expression profiles from patients previously clinically classified as benign breast disease, then the patient is prognosticated to have a benign breast disease.

In one embodiment, in step b) the biological material is brought into contact with reagents specific for a combination of at least 4 and no more than 28 target genes, wherein the reagents include at least reagents specific for the target genes comprising the nucleic acid sequence set forth in SEQ ID NOs 1, 2 or 3, 4 and 5 or 6, respectively, and the expression level of at least said 4 genes is determined in step c) to obtain the expression profile for the patient.

In another embodiment, in step b) the biological material is brought into contact with reagents specific for a combination of 28 genes, wherein the reagents include reagents specific for the target genes comprising the nucleic acid sequence set forth in SEQ ID NOs 1 to 44 respectively, and the expression level of the 28 genes is determined in step c) to obtain the expression profile for the patient.

Particularly, the biological sample taken from the patient is a blood sample. More particularly, the biological material comprises nucleic acids.

In one embodiment, the at least one specific reagent of step b) comprises at least one hybridization probe. In another embodiment, the specific reagents of step b) comprises at least one hybridization probe and at least one primer. In a further embodiment, the specific reagents of step b) comprises one hybridization probe and two primers.

The present invention also provides a kit for discriminating breast cancer from benign breast disease in a biological sample from a patient comprises at least one specific reagent for at least one target gene and no more than 28 specific reagents for 28 target genes comprising the nucleic acid sequences set forth in SEQ ID NOs 1 to 44, wherein the at least one reagent is specific for at least a target gene comprising a nucleic acid sequence selected from the nucleic acid sequences set forth in SEQ ID NOs: 1, 2 or 3, 4 and 5 or 6.

In one embodiment, the kit of the present invention comprises reagents specific for a combination of at least 4 and no more than 28 target genes, wherein the reagents include at least reagents specific for the target genes comprising the nucleic acid sequence set forth in SEQ ID NOs 1, 2 or 3, 4 and 5 or 6, respectively.

In another embodiment, the kit of the present invention comprises reagents specific for a combination of 28 target genes, wherein the reagents include reagents specific for the target genes comprising the nucleic acid sequence set forth in SEQ ID NOs 1 to 44.

The present invention also relates to the use of at least one specific reagent for at least one target gene and no more than specific reagents for 28 target genes comprising the nucleic acid sequences set forth in SEQ ID NOs 1 to 44 in the manufacture of a composition for discriminating breast cancer from benign breast disease in a biological sample from a patient, wherein the at least one reagent is specific for at least a target gene comprising a nucleic acid sequence selected from the nucleic acid sequences set forth in SEQ ID NOs: 1, 2 or 3, 4 and 5 or 6.

In one embodiment, the present invention relates to use of reagents specific for a combination of at least 4 and no more than 28 target genes in the manufacture of a composition for discriminating breast cancer from benign breast disease in a biological sample from a patient, wherein the reagents include at least reagents specific for the target genes comprising the nucleic acid sequence set forth in SEQ ID NOs 1, 2 or 3, 4 and 5 or 6, respectively.

In another embodiment, the present invention relates to use of a combination of 28 target genes in the manufacture of a composition for discriminating breast cancer from benign breast disease in a biological sample from a patient, wherein the reagents include reagents specific for the target genes comprising the nucleic acid sequence set forth in SEQ ID NOs 1 to 44.

DETAILED DESCRIPTION OF THE INVENTION

The present invention proposes to solve all the drawbacks of the prior art by providing a diagnostic tool for discriminating within BI-RADS 0 patients, between BC and BBD. Considering most of the patients whose mammography classified as BI-RADS 0 have breast lesion, the present study aims to discriminate BC from BBD. This is very different from the earlier researches which focused on the expression pattern of breast cancer patients and patients with no signs of this disease. That eliminates some not cancer-specific factors to the detection of cancer such as some inflammatory response regulation.

Surprisingly, the inventors have demonstrated that the analysis of the expression of at least one target gene selected from CHI3, CLEC4C, LILRA3 and TUBB2A gives an information that is sufficient for distinguishing BDD patients from BC. Of course, the analysis of the expression of the above target genes, taken in combination, improves the sensitivity and the specificity of the result, likewise the analysis of the expression profile of 28 target genes, such as described below in table 1, including CHI3, CLEC4C, LILRA3 and TUBB2A.

TABLE 1 SEQ ID Abbreviated Accession NOs: name Name of gene number 1 CHI3L1 Chitinase 3-like 1 (cartilage glycoprotein-39) ENST00000255409 2 CLEC4C C-type lectin domain family 4, member C ENST00000354629 3 ENST00000360345 4 LILRA3 Leukocyte immunoglobulin-like receptor, subfamily A ENST00000251390 (without TM domain), member 3 5 TUBB2A Tubulin, beta 2A ENST00000259218 6 ENST00000333628 7 ADAM12 ADAM metallopeptidase domain 12 ENST00000368676 8 CHURC1 Churchill domain containing 1 ENST00000359118 9 RNF182 Ring finger protein 182 ENST00000313403 10 TMEM176B Transmembrane protein 176B ENST00000326442 11 ENST00000429904 12 ENST00000434545 13 ENST00000447204 14 FAM118A Family with sequence similarity 118, member A ENST00000216214 15 ENST00000441876 16 ANKRD20A Ankyrin repeat domain 20 family, member A1/2/3/4/5 ENST00000377477 17 KLRC1/2 Killer cell lectin-like receptor subfamily C, ENST00000347831 18 member 1/2 ENST00000359151 19 ENST00000381902 20 KIAA1671 KIAA1671 protein ENST00000358431 21 ZBTB44 Zinc finger and BTB domain containing 44 ENST00000454539 22 LQK1 LQK1 hypothetical protein short isoform NR_027285 23 NR_027286 24 APOBEC3A Apolipoprotein B mRNA editing enzyme, catalytic ENST00000249116 25 polypeptide-like 3A ENST00000402255 26 LOC283788 Homo sapiens cDNA FLJ90087 fis, clone HEMBA1005230, NR_027436 weakly similar to zinc protein 140 27 FAM87A/B Family with sequence similarity 87, member A/B ENST00000330148 28 LOC642236 Similar to FRG1 protein (FSHD region gene 1 ENST00000226798 protein) 29 C4A/B Complement component 4A/B ENST00000428596 30 ENTPD5 Ectonucleoside triphosphate diphosphohydrolase5 ENST00000334696 31 LOC728263 Similar to hCG1818012 NG_008780 32 MGC15705 Putative uncharacterized protein MGC15705. ENST00000425084 33 FAM160A1 Family with sequence similarity 160 A1 ENST00000340515 34 ENST00000435205 35 PLXDC1 Plexin domain containing 1 ENST00000315392 36 SFN Stratifin ENST00000339276 37 CLU Clusterin ENST00000316403 38 ENST00000380446 39 ENST00000405140 40 PSPH Phosphoserine phosphatase ENST00000275605 41 ENST00000395471 42 ENST00000437355 43 HLA-DQB1 Major Histocompatibility Complex, class II, DQB1 ENST00000399084 44 ENST00000434651

Several variants sometimes exist for the same target gene, as revealed, for example, in table 1. In the present invention, all the variants are relevant and are indifferently analyzed. It is clearly understood that, if various isoforms of these genes exist, all the isoforms are relevant for the present invention.

The inventors have identified peripheral blood mRNA signatures which can help to discriminate breast cancer from benign breast disease, with a particular interest in patients with non-conclusive mammography.

Accordingly the present invention relates to a method for discriminating between breast cancer and benign breast disease in a biological sample from a patient, wherein it comprises the following steps:

-   a) obtaining the biological sample comprising a biological material     from the patient, -   b) contacting the biological material from the biological sample     with at least one specific reagent for at least one target gene and     no more than 28 specific reagents for 28 target genes comprising the     nucleic acid sequences set forth in SEQ ID NOs 1 to 44, wherein the     at least one reagent is specific for at least a target gene     comprising a nucleic acid sequence selected from the nucleic acid     sequences set forth in SEQ ID NOs: 1, 2 or 3, 4 and 5 or 6, and -   c) determining the expression level of at least one target gene     comprising a nucleic acid sequence selected from the nucleic acid     sequences set forth in SEQ ID NOs: 1, 2 or 3, 4 and 5 or 6 to obtain     an expression profile for the patient, and -   d) performing analysis of the expression profile of the patient with     expression profiles of target genes from patients previously     clinically classified as breast cancer and expression profiles of     target genes from patients previously clinically classified as     benign breast disease, wherein: if the expression profile of the     patient is clustered with the expression profiles from patients     previously clinically classified as breast cancer, then the patient     is prognosticated to have breast cancer, and if the expression     profile of the patient is clustered with the expression profiles     from patients previously clinically classified as benign breast     disease, then the patient is prognosticated to have a benign breast     disease.

In one or more embodiments it is possible in step b) to bring the biological material into contact with reagents specific for a combination of at least 2, or at least 3 or at least 4 target genes and no more than 28 target genes, wherein the reagents include at least reagents specific for the target genes comprising the nucleic acid sequence set forth in any one of SEQ ID NOs 1, 2 or 3, 4 and 5 or 6, respectively, and the expression level of at least 2, 3 or 4 genes is determined in step c).

Examples of combination of target genes are described below:

-   SEQ ID NO: 1 and SEQ ID NO: 2 or 3 -   SEQ ID NO: 1 and SEQ ID NO: 4 -   SEQ ID NO: 1 and SEQ ID NO: 5 or 6 -   SEQ ID NO: 2 or 3 and SEQ ID NO: 4 -   SEQ ID NO: 2 or 3 and SEQ ID NO: 5 or 6 -   SEQ ID NO: 4 and SEQ ID NO: 5 or 6 -   SEQ ID NO: 1, SEQ ID NO: 2 or 3 and SEQ ID NO: 4 -   SEQ ID NO: 1, SEQ ID NO: 2 or 3 and SEQ ID NO: 5 or 6 -   SEQ ID NO: 1, SEQ ID NO: 4 and SEQ ID NO: 5 or 6 -   SEQ ID NO: 2 or 3, SEQ ID NO: 4 and SEQ ID NO: 5 or 6 -   SEQ ID NO: 4, SEQ ID NO: 5 or 6 and SEQ ID NO: 2 or 3, and -   SEQ ID NO: 1, SEQ ID NO: 2 or 3, SEQ ID NO: 4 and SEQ ID NO: 5 or 6;     the following combinations of target genes SEQ ID NO: 1, -   SEQ ID NO: 2, SEQ ID NO: 4 and SEQ ID NO: 5 and SEQ ID NO: 1, -   SEQ ID NO: 3, SEQ ID NO: 4 and SEQ ID NO: 6 being preferred.

Consequently, in one embodiment of the method of the present invention in step b) the biological material is brought into contact with reagents specific for a combination of at least 4 and no more than 28 target genes, wherein the reagents include at least reagents specific for the target genes comprising the nucleic acid sequence set forth in SEQ ID NOs 1, 2 or 3, 4 and 5 or 6, respectively, and the expression level of at least said 4 genes is determined in step c) to obtain the expression profile for the patient.

In another embodiment of the method in step b) the biological material is brought into contact with reagents specific for a combination of 28 genes, wherein the reagents include reagents specific for the target genes comprising the nucleic acid sequence set forth in SEQ ID NOs 1 to 44 respectively, and the expression level of the 28 genes is determined in step c) to obtain the expression profile for the patient.

The biological sample taken from the patient is any sample liable to contain a biological material as defined hereinafter, in particular blood, plasma, serum, tissue, circulating cells sample, blood sample being preferred. This biological sample is provided by any type of sampling known to those skilled in the art.

In an embodiment of the method of the invention, the biological material can be extracted from the biological sample by any of the nucleic acid extraction and purification protocols well known to those skilled in the art. In another embodiment of the present invention the target biological material is not extracted from the biological sample and its analysis is directly performed from the sample.

The term “biological material” is intended to mean any material that makes it possible to detect the expression of a target gene. The biological material may in particular comprise proteins, or nucleic acids, such as, in particular, deoxyribonucleic acids (DNA) or ribonucleic acids (RNA). The nucleic acid may in particular be an RNA (ribonucleic acid).

According to a preferred embodiment of the invention, the biological material is extracted in step and comprises nucleic acids, preferably RNAs, and even more preferably total RNA. Total RNA comprises transfer RNAs (tRNA), messenger RNAs (mRNAs), such as the mRNAs transcribed from the target gene, but also transcribed from any other gene, and ribosomal RNAs. This biological material comprises material specific for a target gene, such as in particular the mRNAs transcribed from the target gene or the proteins derived from these mRNAs.

By way of indication, the nucleic acid extraction can be carried out by: a step consisting of lysis of the cells present in the biological sample, in order to release the nucleic acids contained in the cells of the patient. By way of example, use may be made of the methods of lysis as described in patent applications: WO 00/05338 regarding mixed magnetic and mechanical lysis, WO 99/53304 regarding electrical lysis, WO 99/15321 regarding mechanical lysis. Those skilled in the art may use other well-known methods of lysis, such as thermal or osmotic shocks or chemical lyses using chaotropic agents such as guanidinium salts (U.S. Pat. No. 5,234,809); a purification step, for separating the nucleic acids from the other cellular constituents released in the lysis step. This generally makes it possible to concentrate the nucleic acids, and can be adapted to the purification of DNA or of RNA. By way of example, use may be made of magnetic particles optionally coated with oligonucleotides, by adsorption or covalence (in this respect, see U.S. Pat. No. 4,672,040 and U.S. Pat. No. 5,750,338), and the nucleic acids which are bound to these magnetic particles can thus be purified by means of a washing step. This nucleic acid purification step is particularly advantageous if it is desired to subsequently amplify said nucleic acids. A particularly advantageous embodiment of these magnetic particles is described in patent applications: WO-A-97/45202 and WO-A-99/35500.

The term “specific reagent” is intended to mean a reagent which, when it is brought into contact with biological material as defined above, binds with the material specific for said target gene. By way of indication, when the specific reagent and the biological material are of nucleic origin, bringing the specific reagent into contact with the biological material allows the specific reagent to hybridize with the material specific for the target gene. The term “hybridization” is intended to mean the process during which, under appropriate conditions, two nucleotide fragments bind with stable and specific hydrogen bonds so as to form a double-stranded complex. These hydrogen bonds form between the complementary adenine (A) and thymine (T) (or uracile (U)) bases (this is referred to as an A-T bond) or between the complementary guanine (G) and cytosine (C) bases (this is referred to as a G-C bond). The hybridization of two nucleotide fragments may be complete (reference is then made to complementary nucleotide fragments or sequences), i.e. the double-stranded complex obtained during this hybridization comprises only A-T bonds and C-G bonds. This hybridization may be partial (reference is then made to sufficiently complementary nucleotide fragments or sequences), i.e. the double-stranded complex obtained comprises A-T bonds and C-G bonds that make it possible to form the double-stranded complex, but also bases not bound to a complementary base. The hybridization between two nucleotide fragments depends on the working conditions that are used, and in particular on the stringency. The stringency is defined in particular as a function of the base composition of the two nucleotide fragments, and also by the degree of mismatching between two nucleotide fragments. The stringency can also depend on the reaction parameters, such as the concentration and the type of ionic species present in the hybridization solution, the nature and the concentration of denaturing agents and/or the hybridization temperature. All these data are well known and the appropriate conditions can be determined by those skilled in the art. In general, depending on the length of the nucleotide fragments that it is intended to hybridize, the hybridization temperature is between approximately 20 and 70.degree. C., in particular between 35 and 65.degree. C. in a saline solution at a concentration of approximately 0.5 to 1 M. A sequence, or nucleotide fragment, or oligonucleotide, or polynucleotide, is a series of nucleotide motifs assembled together by phosphoric ester bonds, characterized by the informational sequence of the natural nucleic acids, capable of hybridizing to a nucleotide fragment, it being possible for the series to contain monomers having different structures and to be obtained from a natural nucleic acid molecule and/or by genetic recombination and/or by chemical synthesis. A motif is a derivative of a monomer which may be a natural nucleotide of nucleic acid, the constitutive elements of which are a sugar, a phosphate group and a nitrogenous base; in DNA, the sugar is deoxy-2-ribose, in RNA, the sugar is ribose; depending on whether DNA or RNA is involved, the nitrogenous base is selected from adenine, guanine, uracile, cytosine and thymine; alternatively the monomer is a nucleotide that is modified in at least one of the three constitutive elements; by way of example, the modification may occur either at the level of the bases, with modified bases such as inosine, methyl-5-deoxycytidine, deoxyuridine, dimethylamino-5-deoxyuridine, diamino-2,6-purine, bromo-5-deoxyuridine or any other modified base capable of hybridization, or at the level of the sugar, for example the replacement of at least one deoxyribose with a polyamide (P. E. Nielsen et al, Science, 254, 1497-1500 (1991)[3]), or else at the level of the phosphate group, for example its replacement with esters in particular selected from diphosphates, alkyl- and arylphosphonates and phosphorothioates.

According to a specific embodiment of the invention, the specific reagent comprises at least one hybridization probe or at least one hybridization probe and at least one primer which is specific for the target gene or at least one hybridization probe and two primers specific for the target genes.

For the purpose of the present invention, the term “amplification primer” is intended to mean a nucleotide fragment comprising from 5 to 100 nucleotides, preferably from 15 to 30 nucleotides that allow the initiation of an enzymatic polymerization, for instance an enzymatic amplification reaction. The term “enzymatic amplification reaction” is intended to mean a process which generates multiple copies of a nucleotide fragment through the action of at least one enzyme. Such amplification reactions are well known to those skilled in the art and mention may in particular be made of the following techniques: PCR (polymerase chain reaction), as described in U.S. Pat. No. 4,683,195, U.S. Pat. No. 4,683,202 and U.S. Pat. No. 4,800,159, LCR (ligase chain reaction), disclosed, for, example, in patent application EP 0 201 184, RCR (repair chain reaction), described in patent application WO 90/01069, 3SR (self sustained sequence replication) with patent application WO 90/06995, NASBA (nucleic acid sequence-based amplification) with patent application WO 91/02818, TMA (transcription mediated amplification) with U.S. Pat. No. 5,399,491 and RT-PCR. When the enzymatic amplification is a PCR, the specific reagent comprises at least two amplification primers, specific for a target gene, that allow the amplification of the material specific for the target gene. The material specific for the target gene then preferably comprises a complementary DNA obtained by reverse transcription of messenger RNA derived from the target gene (reference is then made to target-gene-specific cDNA) or a complementary RNA obtained by transcription of the cDNAs specific for a target gene (reference is then made to target-gene-specific cRNA). When the enzymatic amplification is a PCR carried out after a reverse transcription reaction, reference is made to RT-PCR.

The term “hybridization probe” is intended to mean a nucleotide fragment comprising at least 5 nucleotides, such as from 5 to 100 nucleotides, in particular from 10 to 75 nucleotides, such as 15-35 nucleotides and 60-70 nucleotides, having a hybridization specificity under given conditions so as to form a hybridization complex with the material specific for a target gene. In the present invention, the material specific for the target gene may be a nucleotide sequence included in a messenger RNA derived from the target gene (reference is then made to target-gene-specific mRNA), a nucleotide sequence included in a complementary DNA obtained by reverse transcription of said messenger RNA (reference is then made to target-gene-specific cDNA), or else a nucleotide sequence included in a complementary RNA obtained by transcription of said cDNA as described above (reference will then be made to target-gene-specific cRNA). The hybridization probe may include a label for its detection. The term “detection” is intended to mean either a direct detection such as a counting method, or an indirect detection by a method of detection using a label. Many methods of detection exist for detecting nucleic acids (see, for example, Kricka et al., Clinical Chemistry, 1999, no 45 (4), p. 453-458 [4] or Keller G. H. et al., DNA Probes, 2nd Ed., Stockton Press, 1993, sections 5 and 6, p. 173-249 [5]). The term “label” is intended to mean a tracer capable of generating a signal that can be detected. A non limiting list of these tracers includes enzymes which produce a signal that can be detected, for example, by colorimetry, fluorescence or luminescence, such as horseradish peroxidase, alkaline phosphatase, beta-galactosidase, glucose-6-phosphate dehydrogenase; chromophores such as fluorescent, luminescent or dye compounds; electron dense groups detectable by electron microscopy or by virtue of their electrical properties such as conductivity, by amperometry or voltametry methods, or by impedance measurement; groups that can be detected by optical methods such as diffraction, surface plasmon resonance, or contact angle variation, or by physical methods such as atomic force spectroscopy, tunnel effect, etc.; radioactive molecules such as ³²P, ³⁵S or ¹²⁵I.

For the purpose of the present invention, the hybridization probe may be a “detection” probe. In this case, the “detection” probe is labeled by means of a label. The detection probe may in particular be a “molecular beacon” detection probe as described by Tyagi & Kramer (Nature biotech, 1996, 14:303-308 [6]). These “molecular beacons” become fluorescent during the hybridization. They have a stem-loop-type structure and contain a fluorophore and a “quencher” group. The binding of the specific loop sequence with its complementary target nucleic acid sequence causes the stem to unroll and the emission of a fluorescent signal during excitation at the appropriate wavelength. The detection probe in particular may be a “reporter probe” comprising a “color-coded barecode” according to NanoString™'s technology.

For the detection of the hybridization reaction, use may be made of target sequences that have been labeled, directly (in particular by the incorporation of a label within the target sequence) or indirectly (in particular using a detection probe as defined above). It is in particular possible to carry out, before the hybridization step, a step consisting in labeling and/or cleaving the target sequence, for example using a labeled deoxy-ribonucleotide triphosphate during the enzymatic amplification reaction. The cleavage may be carried out in particular by the action of imidazole or of manganese chloride. The target sequence may also be labeled after the amplification step, for example by hybridizing a detection probe according to the sandwich hybridization technique described in document WO 91/19812. Another specific preferred method of labeling nucleic acids is described in application FR 2780059.

According to a preferred embodiment of the invention, the detection probe comprises a fluorophore and a quencher.

According to an even more preferred embodiment of the invention, the hybridization probe comprises an FAM (6-carboxy-fluorescein) or ROX (6-carboxy-X-rhodamine) fluorophore at its 5′ end and a quencher (Dabsyl) at its 3′ end.

The hybridization probe may also be a “capture” probe. In this case, the “capture” probe is immobilized or can be immobilized on a solid substrate by any appropriate means, i.e. directly or indirectly, for example by covalence or adsorption. As solid substrate, use may be made of synthetic materials or natural materials, optionally chemically modified, in particular polysaccharides such as cellulose-based materials, for example paper, cellulose derivatives such as cellulose acetate and nitrocellulose or dextran, polymers, copolymers, in particular based on styrene-type monomers, natural fibers such as cotton, and synthetic fibers such as nylon; inorganic materials such as silica, quartz, glasses or ceramics; latices; magnetic particles; metal derivatives, gels, etc. The solid substrate may be in the form of a microtitration plate, of a membrane as described in application WO-A-94/12670 or of a particle. It is also possible to immobilize on the substrate several different capture probes, each being specific for a target gene. In particular, a biochip on which a large number of probes can be immobilized may be used as substrate. The term “biochip” is intended to mean a solid substrate that is small in size, to which a multitude of capture probes are attached at predetermined positions. The biochip, or DNA chip, concept dates from the beginning of the 1990s. It is based on a multidisciplinary technology that integrates microelectronics, nucleic acid chemistry, image analysis and information technology. The operating principle is based on a foundation of molecular biology: the hybridization phenomenon, i.e. the pairing, by complementarity, of the bases of two DNA and/or RNA sequences. The biochip method is based on the use of capture probes attached to a solid substrate, on which probes a sample of target nucleotide fragments directly or indirectly labeled with fluorochromes is made to act. The capture probes are positioned specifically on the substrate or chip and each hybridization gives a specific piece of information, in relation to the target nucleotide fragment. The pieces of information obtained are cumulative, and make it possible, for example, to quantify the level of expression of one or more target genes. In order to analyze the expression of a target gene, a substrate comprising a multitude of probes, which correspond to all or part of the target gene, which is transcribed to mRNA, can then be prepared. For the purpose of the present invention, the term “low-density substrate” is intended to mean a substrate comprising fewer than 50 probes. For the purpose of the present invention, the term “medium-density substrate” is intended to mean a substrate comprising from 50 probes to 10 000 probes. For the purpose of the present invention, the term “high-density substrate” is intended to mean a substrate comprising more than 10 000 probes.

The cDNAs or cRNAs specific for a target gene that it is desired to analyze are then hybridized, for example, to specific capture probes. After hybridization, the substrate or chip is washed and the labeled cDNA or cRNA/capture probe complexes are revealed by means of a high-affinity ligand bound, for example, to a fluorochrome-type label. The fluorescence is read, for example, with a scanner and the analysis of the fluorescence is processed by information technology. By way of indication, mention may be made of the DNA chips developed by the company Affymetrix (“Accessing Genetic Information with High-Density DNA arrays”, M. Chee et al., Science, 1996, 274, 610-614 [7]. “Light-generated oligonucleotide arrays for rapid DNA sequence analysis”, A. Caviani Pease et al., Proc. Natl. Acad. Sci. USA, 1994, 91, 5022-5026 [8]), for molecular diagnoses. In this technology, the capture probes are generally small in size, around 25 nucleotides. Other examples of biochips are given in the publications by G. Ramsay, Nature Biotechnology, 1998, No. 16, p. 40-44 [9]; F. Ginot, Human Mutation, 1997, No. 10, p. 1-10 [10]; J. Cheng et al, Molecular diagnosis, 1996, No. 1 (3), p. 183-200 [11]; T. Livache et al, Nucleic Acids Research, 1994, No. 22 (15), p. 2915-2921 [12]; J. Cheng et al, Nature Biotechnology, 1998, No. 16, p. 541-546 [13] or in U.S. Pat. No. 4,981,783, U.S. Pat. No. 5,700,637, U.S. Pat. No. 5,445,934, U.S. Pat. No. 5,744,305 and U.S. Pat. No. 5,807,522. The main characteristic of the solid substrate should be to conserve the hybridization characteristics of the capture probes on the target nucleotide fragments while at the same time generating a minimum background noise for the method of detection. Three main types of fabrication can be distinguished for immobilizing the probes on the substrate.

First of all, there is a first technique which consists in depositing pre-synthesized probes. The attachment of the probes is carried out by direct transfer, by means of micropipettes or of microdots or by means of an inkjet device. This technique allows the attachment of probes having a size ranging from a few bases (5 to 10) up to relatively large sizes of 60 bases (printing) to a few hundred bases (microdeposition).

Printing is an adaptation of the method used by inkjet printers. It is based on the propulsion of very small spheres of fluid (volume <1 nl) at a rate that may reach 4000 drops/second. The printing does not involve any contact between the system releasing the fluid and the surface on which it is deposited.

Microdeposition consists in attaching long probes of a few tens to several hundred bases to the surface of a glass slide. These probes are generally extracted from databases and are in the form of amplified and purified products. This technique makes it possible to produce chips called microarrays that carry approximately ten thousand spots, called recognition zones, of DNA on a surface area of a little less than 4 cm.sup.2. The use of nylon membranes, referred to as “macroarrays”, which carry products that have been amplified, generally by PCR, with a diameter of 0.5 to 1 mm and the maximum density of which is 25 spots/cm.sup.2, should not however be forgotten. This very flexible technique is used by many laboratories. In the present invention, the latter technique is considered to be included among biochips. A certain volume of sample can, however, be deposited at the bottom of a microtitration plate, in each well, as in the case in patent applications WO-A-00/71750 and FR 00/14896, or a certain number of drops that are separate from one another can be deposited at the bottom of one and the same Petri dish, according to another patent application, FR 00/14691.

The second technique for attaching the probes to the substrate or chip is called in situ synthesis. This technique results in the production of short probes directly at the surface of the chip. It is based on in situ oligonucleotide synthesis (see, in particular, patent applications WO 89/10977 and WO 90/03382) and is based on the oligonucleotide synthesizer process. It consists in moving a reaction chamber, in which the oligonucleotide extension reaction takes place, along the glass surface.

Finally, the third technique is called photolithography, which is a process that is responsible for the biochips developed by Affymetrix. It is also an in situ synthesis. Photolithography is derived from microprocessor techniques. The surface of the chip is modified by the attachment of photolabile chemical groups that can be light-activated. Once illuminated, these groups are capable of reacting with the 3′ end of an oligonucleotide. By protecting this surface with masks of defined shapes, it is possible to selectively illuminate and therefore activate areas of the chip where it is desired to attach one or other of the four nucleotides. The successive use of different masks makes it possible to alternate cycles of protection/reaction and therefore to produce the oligonucleotide probes on spots of approximately a few tens of square micrometers (μm²). This resolution makes it possible to create up to several hundred thousand spots on a surface area of a few square centimeters (cm²). Photolithography has advantages: in bulk in parallel, it makes it possible to create a chip of N-mers in only 4.times.N cycles. All these techniques can be used with the present invention. According to a preferred embodiment of the invention, the at least one specific reagent of step b) defined above comprises at least one hybridization probe which is preferably immobilized on a substrate. This substrate is preferably a low-, high- or medium-density substrate as defined above.

These hybridization steps on a substrate comprising a multitude of probes may be preceded by an enzymatic amplification reaction step, as defined above, in order to increase the amount of target genetic material.

In step c), the determination of the expression level of a target gene can be carried out by any of the protocols known to those skilled in the art. In general, the expression of a target gene can be analyzed by detecting the mRNAs (messenger RNAs) that are transcribed from the target gene at a given moment or by detecting the proteins derived from these mRNAs.

The invention preferably relates to the determination of the expression level of a target gene by detection of the mRNAs derived from this target gene according to any of the protocols well known to those skilled in the art. According to a specific embodiment of the invention, the expression level of several target genes is determined simultaneously, by detection of several different mRNAs, each mRNA being derived from a target gene.

When the specific reagent comprises at least one amplification primer, it is possible, to determine the expression level of the target gene in the following way: 1) After having extracted, as biological material, the total RNA (comprising the transfer RNAs (tRNAs), the ribosomal RNAs (rRNAs) and the messenger RNAs (mRNAs)) from a biological sample as presented above, a reverse transcription step is carried out in order to obtain the complementary DNAs (or cDNAs) of said mRNAs. By way of indication, this reverse transcription reaction can be carried out using a reverse transcriptase enzyme which makes it possible to obtain, from an RNA fragment, a complementary DNA fragment. The reverse transcriptase enzyme from AMV (Avian Myoblastosis Virus) or from MMLV (Moloney Murine Leukaemia Virus) can in particular be used. When it is more particularly desired to obtain only the cDNAs of the mRNAs, this reverse transcription step is carried out in the presence of nucleotide fragments comprising only thymine bases (polyT), which hybridize by complementarity to the polyA sequence of the mRNAs so as to form a polyT-polyA complex which then serves as a starting point for the reverse transcription reaction carried out by the reverse transcriptase enzyme. cDNAs complementary to the mRNAs derived from a target gene (target-gene-specific cDNA) and cDNAs complementary to the mRNAs derived from genes other than the target gene (cDNAs not specific for the target gene) are then obtained. 2) The amplification primer(s) specific for a target gene is (are) brought into contact with the target-gene-specific cDNAs and the cDNAs not specific for the target gene. The amplification primer(s) specific for a target gene hybridize(s) with the target-gene-specific cDNAs and a predetermined region, of known length, of the cDNAs originating from the mRNAs derived from the target gene is specifically amplified. The cDNAs not specific for the target gene are not amplified, whereas a large amount of target-gene-specific cDNAs is then obtained. For the purpose of the present invention, reference is made, without distinction, to “target-gene-specific cDNAs” or to “cDNAs originating from the mRNAs derived from the target gene”. This step can be carried out in particular by means of a PCR-type amplification reaction or by any other amplification technique as defined above. By PCR, it is also possible to simultaneously amplify several different cDNAs, each one being specific for different target genes, by using several pairs of different amplification primers, each one being specific for a target gene: reference is then made to multiplex amplification. 3) The expression of the target gene is determined by detecting and quantifying the target-gene-specific cDNAs obtained in step 2) above. This detection can be carried out after electrophoretic migration of the target-gene-specific cDNAs according to their size. The gel and the medium for the migration can include ethidium bromide so as to allow direct detection of the target-gene-specific cDNAs when the gel is placed, after a given migration period, on a UV (ultraviolet)-ray light table, through the emission of a light signal. The greater the amount of target-gene-specific cDNAs, the brighter this light signal. These electrophoresis techniques are well known to those skilled in the art. The target-gene-specific cDNAs can also be detected and quantified using a quantification range obtained by means of an amplification reaction carried out until saturation. In order to take into account the variability in enzymatic efficiency that may be observed during the various steps (reverse transcription, PCR, etc.), the expression of a target gene of various groups of patients can be normalized by simultaneously determining the expression of a “housekeeping” gene, the expression of which is similar in the various groups of patients. By realizing a ratio of the expression of the target gene to the expression of the housekeeping gene, i.e. by realizing a ratio of the amount of target-gene-specific cDNAs to the amount of housekeeping-gene-specific cDNAs, any variability between the various experiments is thus corrected. Those skilled in the art may refer in particular to the following publications: Bustin S A, J Mol Endocrinol, 2002, 29: 23-39; Giulietti A Methods, 2001, 25: 386-401.

When the specific reagent comprises at least one hybridization probe, the expression of a target gene can be determined in the following way: 1) After having extracted, as biological material, the total RNA from a biological sample as presented above, a reverse transcription step is carried out as described above in order to obtain cDNAs complementary to the mRNAs derived from a target gene (target-gene-specific cDNA) and cDNAs complementary to the mRNAs derived from genes other than the target gene (cDNA not specific for the target gene). 2) All the cDNAs are brought into contact with a substrate, on which are immobilized capture probes specific for the target gene whose expression it is desired to analyze, in order to carry out a hybridization reaction between the target-gene-specific cDNAs and the capture probes, the cDNAs not specific for the target gene not hybridizing to the capture probes. The hybridization reaction can be carried out on a solid substrate which includes all the materials as indicated above. According to a preferred embodiment, the hybridization probe is immobilized on a substrate. Preferably, the substrate is a low-, high- or medium-density substrate as defined above. The hybridization reaction may be preceded by a step consisting of enzymatic amplification of the target-gene-specific cDNAs as described above, so as to obtain a large amount of target-gene-specific cDNAs and to increase the probability of a target-gene-specific cDNA hybridizing to a capture probe specific for the target gene. The hybridization reaction may also be preceded by a step consisting in labeling and/or cleaving the target-gene-specific cDNAs as described above, for example using a labeled deoxyribonucleotide triphosphate for the amplification reaction. The cleavage can be carried out in particular by the action of imidazole and manganese chloride. The target-gene-specific cDNA can also be labeled after the amplification step, for example by hybridizing a labeled probe according to the sandwich hybridization technique described in document WO-A-91/19812. Other preferred specific methods for labeling and/or cleaving nucleic acids are described in applications WO 99/65926, WO 01/44507, WO 01/44506, WO 02/090584, WO 02/090319. 3) A step consisting of detection of the hybridization reaction is subsequently carried out. The detection can be carried out by bringing the substrate on which the capture probes specific for the target gene are hybridized with the target-gene-specific cDNAs into contact with a “detection” probe labeled with a label, and detecting the signal emitted by the label. When the target-gene-specific cDNA has been labeled beforehand with a label, the signal emitted by the label is detected directly.

When the at least one specific reagent is brought into contact in step b) comprises at least one hybridization probe, the expression of a target gene can also be determined in the following way: 1) After having extracted, as biological material, the total RNA from a biological sample as presented above, a reverse transcription step is carried out as described above in order to obtain the cDNAs of the mRNAs of the biological material. The polymerization of the complementary RNA of the cDNA is subsequently carried out using a T7 polymerase enzyme which functions under the control of a promoter and which makes it possible to obtain, from a DNA template, the complementary RNA. The cRNAs of the cDNAs of the mRNAs specific for the target gene (reference is then made to target-gene-specific cRNA) and the cRNAs of the cDNAs of the mRNAs not specific for the target gene are then obtained. 2) All the cRNAs are brought into contact with a substrate on which are immobilized capture probes specific for the target gene whose expression it is desired to analyze, in order to carry out a hybridization reaction between the target-gene-specific cRNAs and the capture probes, the cRNAs not specific for the target gene not hybridizing to the capture probes. When it is desired to simultaneously analyze the expression of several target genes, several different capture probes can be immobilized on the substrate, each one being specific for a target gene. The hybridization reaction may also be preceded by a step consisting in labeling and/or cleaving the target-gene-specific cRNAs as described above. 3) A step consisting of detection of the hybridization reaction is subsequently carried out. The detection can be carried out by bringing the substrate on which the capture probes specific for the target gene are hybridized with the target-gene-specific cRNA into contact with a “detection” probe labeled with a label, and detecting the signal emitted by the label. When the target-gene-specific cRNA has been labeled beforehand with a label, the signal emitted by the label is detected directly. The use of cRNA is particularly advantageous when a substrate of biochip type on which a large number of probes are hybridized is used.

The invention also relates to a substrate, comprising at least 4 hybridization probes selected from probes specific for the target genes with a nucleic sequence having any one of SEQ ID NOs 1 to 44 and in particular 4 hybridization probes specific for the target genes with a nucleic acid sequence having any one of SEQ ID NOs 1, 2 or 3, 4 and 5 or 6.

The invention further relates to the use of a substrate as defined above, for discriminating BC from BBD.

The present invention also concerns a kit for discriminating breast cancer from benign breast disease in a biological sample from a patient comprises at least one specific reagent for at least one target gene and no more than specific reagents for 28 target genes comprising the nucleic acid sequences set forth in SEQ ID NOs 1 to 44, wherein the at least one reagent is specific for at least a target gene comprising a nucleic acid sequence selected from the nucleic acid sequences set forth in SEQ ID NOs: 1, 2 or 3 4 and 5 or 6.

The specific reagents can targeted a combination of at least two, three or four genes as described above in more detail but no more than 28 genes and in one embodiment the kit comprises reagents specific for a combination of at least 4 and no more than 28 target genes, wherein the reagents include at least reagents specific for the target genes comprising the nucleic acid sequence set forth in SEQ ID NOs 1, 2 or 3, 4 and 5 or 6, respectively. In another embodiment the kit comprises reagents specific for a combination of 28 target genes, wherein the reagents include reagents specific for the target genes comprising the nucleic acid sequence set forth in SEQ ID NOs 1 to 28.

EXAMPLES

I) Materials and Methods

1. Characteristic of Patients and Samples

Blood samples were collected from 84 patients with breast cancer and 94 patients with breast benign disease in this study. All patients had been referred to the Breast Surgery Department of Cancer Hospital, Fudan University (Shanghai, China) with suspected breast cancer between July 2007 and December 2008. Each of them went through the mammographic screening in the hospital, while all the BI-RADS category of the patients was determined by three professional radiologists. About 2.5 ml of peripheral blood were collected from each of 84 women with BC and 94 women with BBD, in Paxgene™ Blood RNA tubes (PreAnalytix) containing an RNA stabilizing solution. All blood samples were collected before fine-needle aspiration operation or any invasive steps which was indicated for cytological investigation on suspected breast lesion. Diagnosis of breast cancer was on the basis of identification of cancer cells on the core-needle biopsy or surgical specimen. Diagnosis of benign disease on the basis of lack of cancer cells at open biopsy. The protocol was approved by the local Ethical Committee for Clinical Research and written informed consent was obtained from all the patients recruited for the study. Final pathologic tumor stage was determined with the TNM staging system and graded using the Nottingham system. In addition tumor type and tumor grade, estrogen receptor (ER), progesterone receptor (PR) and Human Epidermal growth factor Receptor 2 (HER2) status and lymph node status were assessed in each tumor.

2. RNA Extraction and Microarray Analysis

Total RNA was extracted with the PAXGene Blood RNA® kit (PreAnalytix) according to the manufacturer's instruction. The quantity of total RNA was measured by spectrophotometer at optical density (OD) 260 nm and the quality was assessed using the RNA 6000 Nano LabChip on a 2100 Bioanalyzer (Agilent Technologies). Only samples with RNA Integrity Number (RIN) between 7 and 10 were analyzed. 50 ng of total RNA was then reversely transcripted and linearly amplified to single strand cDNA using Ribo-SPIA Ovation technology with WT-Ovation RNA Amplification System (NuGen Technologies), according to the manufacturer's standard protocol and the products were purified with QIAquick PCR purification kit (Qiagen GmbH). 2 μg amplified and purified cDNA was subsequently fragmented with RQ1 RNase-Free DNase (Promega corporation) and labeled with biotinylated deoxynucleoside triphosphates by Terminal Transferase (Roche Diagnostics GmbH) and DNA labeling reagent (Affymetrix). The labeled cDNA was hybridized onto HG U133 plus 2.0 Array (Affymetrix) in a Hybridization Oven 640 (Affymetrix) at 60 rpm, 50° C. for 18 h. The HG U133 plus 2.0 Array contains 54,675 probe sets representing approximately 39,000 best characterized human genes. After hybridization, the arrays were washed and stained according to the Affymetrix protocol EukGE-WS2v4 using an Affymetrix fluidic station FS450. The arrays were scanned with the Affymetrix scanner 3000.

3. Microarray Data Analysis

Quality Control and Preprocessing. Quality control analyses were performed according to the suggestions of standard Affymetrix quality control parameters. Based on the evaluation criteria, all blood sample measurements fulfilled the minimal quality requirements. The Affymetrix expression arrays were preprocessed by RMA (Robust Multi-chip Average) [10] with background correction, quantile normalization and median polish summarization. Probesets with extreme signal intensity (lower than 50 or higher than 214) were filtered out. Then, sequence information based filtering was performed according to the Entrez Gene database information. Probesets without Entrez Gene ID annotation were removed. For multiple probesets mapping to the same Entrez Gene ID, only the probeset with the largest value of Interquartile Range was retained and the others were removed. After all, to reduce the likelihood of batch, a normalization algorithm, ComBat [11] was applied. The ComBat method (statistics.byu.edu/johnson/ComBat/) applies either parametric or nonparametric empirical Bayes framework for adjusting batch effects in a given data set.

4. Molecular Signature Identification.

After appropriate pre-processing to reduce redundant probesets and batch variation across expression data, Molecular Signature Identification was performed based on the preprocessed expression data. 84 BC and 94 BBD samples with mammographic results and confirmed pathologic information were categorized into two groups, 79 BC+73 BBD with BI-RADS 1-5, and 5 BC+21 BBD with BI-RADS 0. 79 BC+73 BBD with BI-RADS 1-5 were used as train set to identify interesting genes by Recursive Feature Elimination (RFE) procedure, and build the classification model by Support Vector Machine (SVM) [12-13]. Inside train set, 5-fold cross validation process was conducted to determine the optimal gene sets. A list of top-100 genes was identified by RFE based on four of the fifth train set. The classification model was created based on the top-100 genes and the model was tested using another one of the fifth train set. This process was run for 1000 iterations, thus one thousand of top 100 gene sets were generated. Eventually, the genes appeared in entire one thousand of 100-top gene lists were identified as the most robust genes to generate the final model using the whole train set. And the model was then applied to completely unseen samples 5 BC+21 BBD with BI-RADS 0.

The preprocessing and statistical steps were executed using R-environment with Bioconductor libraries [14-18].

II) Results

1. Patient Characteristics

The present study was performed on 178 samples from 84 BC and 94 BBD patients with mammographic results and confirmed pathologic information, which then categorized in two groups, 79 BC+73 BBD with BI-RADS 1-5, and 5 BC+21 BBD with BI-RADS 0. Table 2 summarizes the clinical characteristics of these BC and BBD patient populations. Briefly, 92% of the cancer patients presented a T0-T2 tumor; 70% and 32% of the tumors were hormone receptor positive and Her2 positive respectively. Benign findings included 51.1% of breast disease, 27.7% of breast fibroadenoma and 21.2% intracanalicular papilloma respectively.

TABLE 2 Characteristics of the population Benign Breast Disease (BBD): 94 patients Age (years) Median 47.4 Range 34-75 Menopausal status Postmenopausal 30 33.7 Premenoposal 59 66.3 Non determined 5 Type of disease Breast disease 48 51.1% Breast fiboadenoma 26 27.7% Intracanalicular papilloma 20 21.2% Breast cancer (BD): 84 patients Age (years) Median 42.5 Range 31-77 Tumor type Ductal carcinoma in Situ (DCIS) 11 13.1% Intra Ductal carcinoma (IDC) 73 86.9% Tumor size T1 (0.1-2 cm) 44 52.4% T2 (>2-5 cm) 34 40.5% T3 (>5 cm) 1 1.2% unknown 5 5.9% Nodal status Positive 25 29.8% Negative 57 67.8% Unknow 2 2.4% TNM Stage 0 10 11.9% I 28 33.3% II 33 39.3% III 11 13.1% Unknow 2 2.4% Histological grade I 1 1.2% I-II 3 3.6% II 43 51.2% II-III 8 9.5% III 18 21.4% Unknow 11 13.1% Estrogen receptor status Negative 19 22.6% Positive 65 77.4% Progeterone receptor status Negative 20 23.8% Positive 64 7.2% Her-2 status Negative 53 63.1% Positive 31 36.9% *pValue

2. Construction and Performance of the Model

By using Recursive Feature Elimination (RFE) procedure and Support Vector Machine (SVM) classification, a set of 28-gene panel (Table 1) was developed, to discriminate BC and BBD patients with BI-RADS 1-5. This 28-gene panel was then tested in the BI-RADS 0 group.

Among the 28 predictive genes, the expression of 15 of them are down-expressed in BC compared to BBD and 13 are up-expressed in BC versus BBD, as summarized in table 3.

TABLE 3 Expression Affymetrix Abbreviated Mean Fold in BC versus SEQ ID NOs: probeset name signal P-value change BBD 1 209395_at CHI3L1 271 5.74 10⁻³ 1.22 Down-regulated 2-3 1552552_s_at CLEC4C 49 5.59 10⁻³ 1.20 Down-regulated 4 206881_s_at LILRA3 73   4 10⁻⁶ 1.43 Down-regulated 5-6 204141_at TUBB2A 684 5.82 10⁻² 1.30 Down-regulated 7 213790_at ADAM12 74 2.53 10⁻³ 1.13 Up-regulated 8 226736_at CHURC1 124 5.54 10⁻⁴ 1.26 Up-regulated 9 230720_at RNF182 49 3.52 10⁻³ 1.58 Up-regulated 10-13 220532_at TMEM176B 97 1.70 10⁻² 1.21 Up-regulated 14-15 219629_at FAM118A 100 1.49 10⁻¹ 1.12 Up-regulated 16 156960_s_at ANKRD20A 70 7.80 10⁻² 1.11 Down-regulated 17-19 206785_s_at KLRC1/2 93 4.87 10⁻² 1.15 Down-regulated 20 225525_at KIAA1671 69 1.75 10⁻² 1.12 Up-regulated 21 1554469_at ZBTB44 58 2.16 10⁻³ 1.13 Down-regulated 22-23 235126_at LQK1 83 2.66 10⁻² 1.14 Up-regulated 24-25 210873_x_at APOBEC3A 335 3.52 10⁻¹ 1.12 Down-regulated 26 229187_at LOC283788 94 1.91 10⁻¹ 1.08 Up-regulated 27 1559140_at FAM87A/B 68 2.32 10⁻² 1.09 Up-regulated 28 242770_at LOC642236 49 2.35 10⁻² 1.14 Up-regulated 29 214428_x_at C4A/B 55 4.77 10⁻² 1.11 Down-regulated 30 1554094_at ENTPDS 87 4.70 10⁻⁵ 1.11 Down-regulated 31 215610_at LOC728263 89 2.03 10⁻³ 1.09 Up-regulated 32 1553623_at MGC15705 79 2.57 10⁻² 1.08 Down-regulated 33-34 242687_at FAM160A1 50 2.48 10⁻² 1.08 Up-regulated 35 219700_at PLXDC1 107 3.82 10⁻³ 1.14 Down-regulated 36 33323_r_at SFN 54 1.26 10⁻¹ 1.09 Down-regulated 37-39 208791_at CLU 112 2.37 10⁻¹ 1.08 Up-regulated 40-42 205048_s_at PSPH 68 4.18 10⁻¹ 1.06 Down-regulated 43-44 212999-_x_at HLA-DQB1 120 1.00 10⁻¹ 1.23 Down-regulated

4-Genes Signature

In a first training set, the 4-gene panel CHI3L1, CLEC4C, LILRA3 and TUBB2A was classified malignant and benign with an estimated accuracy of 71% (76% sensitivity and 66% specificity).

Of the 79 breast cancer samples, 60 were classified correctly, while 48 of the 73 benign samples were assigned to the correct class (Table 4a).

TABLE 4a Classification value for the identified signature on Training Dataset Prediction outcome Training set BBD BC Pathological BBD 48 25 diagnosis BC 19 60 Accuracy = 71%, Sensitivity = 76%, Specificity = 66%

The metric performance of the model in the independent BI-RADS 0 test set was reported in Table 4b. Three of the five cancer samples were correctly classified, while 8 out of 21 benign patients were accurately classified, with a sensitivity of 60% and specificity of 38% respectively. The accuracy of the model in the test set of BI-RADS 0 is 42%.

TABLE 4b Classification value for the identified signature on Independent Test Dataset Prediction outcome Training set BBD BC Pathological BBD 8 13 diagnosis BC 2 3 Accuracy = 42%, Sensitivity = 60%, Specificity = 38%

28-Genes Signature

In the training set, the 28-gene panel was classified malignant and benign with an estimated accuracy of 88% (94% sensitivity and 84% specificity).

Of the 79 breast cancer samples, 74 were classified correctly, while 61 of the 73 benign samples were assigned to the correct class (Table 5a).

TABLE 5a Classification value for the identified signature on Training Dataset Prediction outcome Training set BBD BC Pathological BBD 61 12 diagnosis BC 5 74 Accuracy = 88%, Sensitivity = 94%, Specificity = 84%

The metric performance of the model in the independent BI-RADS 0 test set was reported in Table 5b. Four of the five cancer samples were correctly classified, while 15 out of 21 benign patients were accurately classified, with a sensitivity of 80% and specificity of 71% respectively. The accuracy of the model in the test set of BI-RADS 0 is 73%.

TABLE 5b Classification value for the identified signature on Independent Test Dataset Prediction outcome Training set BBD BC Pathological BBD 15 6 diagnosis BC 1 4 Accuracy = 73%, Sensitivity = 80%, Specificity = 71%

The inventors have also analyzed whether any of the clinical characteristics were significantly overrepresented among the subjects incorrectly predicted. They found that the only false negative case in the test set was a 46 years old woman who had Paget's disease and DCIS.

BIBLIOGRAPHIC REFERENCES

-   1. Margaret M. Eberl, MPH, Chester H. Fox, Stephen B. Edge,     Cathleen A. Carter, and Martin C. Mahoney. BI-RADS Classification     for Management of Abnormal Mammograms, The Journal of the American     Board of Family Medicine 19:161-1 -   2. Whitney A R, Diehn M, Popper S J, Alizadeh A A, Boldrick J C,     Relman D A, Brown P O. Individuality and variation in gene     expression patterns in human blood.Proc Natl Acad Sci USA. 2003,     18;100(4):1896-901. -   3. P. E. Nielsen et al, Science, 254, 1497-1500 (1991). -   4. Kricka et al., Clinical Chemistry, 1999, no 45 (4), p. 453-458. -   5. Keller G. H. et al., DNA Probes, 2nd Ed., Stockton Press, 1993,     sections 5 and 6, p. 173-249. -   6. Tyagi & Kramer, Nature Biotech, 1996, 14:303-308. -   7. M. Chee et al., Science, 1996, 274, 610-614].

8. A. Caviani Pease et al., Proc. Natl. Acad. Sci. USA, 1994, 91, 5022-5026.

-   9. G. Ramsay, Nature Biotechnology, 1998, No. 16, p. 40-44. -   10. F. Ginot, Human Mutation, 1997, No. 10, p. 1-10. -   11. J. Cheng et al, Molecular diagnosis, 1996, No. 1 (3), p.     183-200. -   12. T. Livache et al, Nucleic Acids Research, 1994, No. 22 (15), p.     2915-2921. -   13. J. Cheng et al, Nature Biotechnology, 1998, No. 16, p. 541-546. -   14. Harris Drucker, Chris J. C. Burges, Linda Kaufman, Alex Smola     and Vladimir Vapnik (1997). “Support Vector Regression Machines”.     Advances in Neural Information Processing Systems 9, NIPS 1996,     155-161, MIT Press. -   15. R Development Core Team (2009). R: A language and environment     for statistical computing. R Foundation for Statistical Computing,     Vienna, Austria. ISBN 3-900051-07-0, URL www.R-project.org -   16. Gentleman R C, Carey V J, Bates D M, Bolstad B, Dettling M,     Dudoit S, Ellis B, Gautier L, Ge Y, Gentry J, et al.: Bioconductor:     open software development for computational biology and     bioinformatics. -   17. Crispin J Miller. simpleaffy (2009): Very simple high level     analysis of Affymetrix data. R package version 2.22.0.     www.bioconductor.org, bioinformatics.picr.man.ac.uk/simpleaffy/ -   18. R. Gentleman, V. Carey, W. Huber and F. Hahne (2009).     genefilter: genefilter: methods for filtering genes from microarray     experiments. R package version 1.28.0. 

1. A method for discriminating between breast cancer and benign breast disease in a biological sample from a patient, the method comprising the following steps: a) obtaining the biological sample comprising a biological material from the patient, b) contacting the biological material from the biological sample with at least one specific reagent for at least one target gene and no more than 28 specific reagents for 28 target genes comprising the full-length nucleic acid sequences set forth in SEQ ID NOS: 1 to 44, wherein the at least one reagent is specific for at least one target gene comprising the full-length nucleic acid sequence set forth in SEQ ID NOS: 1 to 6, c) measuring the expression level of the at least one target gene to obtain an expression profile for the patient, and d) performing clustering analysis of the expression profile of the patient with expression profiles of the at least one target gene from patients previously clinically classified as having breast cancer and expression profiles of the at least one target gene from patients previously clinically classified as having benign breast disease, wherein: if the expression profile of the patient is clustered with the expression profiles from patients previously clinically classified as having breast cancer, then the patient is diagnosed to have breast cancer, and if the expression profile of the patient is clustered with the expression profiles from patients previously clinically classified as having benign breast disease, then the patient is diagnosed to have a benign breast disease.
 2. The method as claimed in claim 1, wherein: in step b) the biological material from the biological sample is contacted with reagents specific for a combination of at least 4 and no more than 28 target genes, the at least four reagents being specific for at least four different target genes respectively comprising the full-length nucleic acid sequences set forth in: 1) SEQ ID NO: 1; and 2) SEQ ID NO: 2 or 3; and 3) SEQ ID NO: 4; and 4) SEQ ID NO: 5 or 6; and the expression level of the target genes is measured in step c) to obtain the expression profile for the patient.
 3. The method as claimed in claim 1, wherein in step b) the biological material is brought into contact with reagents specific for a combination of 28 target genes, and the expression level of the 28 genes is measured in step c) to obtain the expression profile for the patient.
 4. The method as claimed in claim 1, wherein the biological sample taken from the patient is a blood sample.
 5. The method as claimed in claim 1, wherein the biological material comprises nucleic acids.
 6. The method as claimed in claim 1, wherein the at least one specific reagent of step b) comprises at least one hybridization probe.
 7. The method as claimed in claim 6, wherein the specific reagents of step b) comprises at least one hybridization probe and at least one primer.
 8. The method as claimed in claim 7, wherein the specific reagents of step b) comprises one hybridization probe and two primers.
 9. A kit for discriminating breast cancer from benign breast disease in a biological sample from a patient, comprising at least one specific reagent for at least one target gene and no more than 28 specific reagents for 28 target genes comprising the full-length nucleic acid sequences set forth in SEQ ID NOS: 1 to 44, wherein the at least one reagent is specific for at least one target gene comprising the full-length nucleic acid sequence set forth in SEQ ID NOS: 1 to
 6. 10. The kit as claimed in claim 9, wherein the at least one specific reagent comprises at least four reagents respectively specific for at least four target genes and no more than 28 reagents, wherein the target genes are selected from the group consisting of genes comprising the full-length nucleic acid sequences set forth in SEQ ID NOS: 1 to 44, the at least four reagents being specific for at least four different target genes respectively comprising the full-length nucleic acid sequences set forth in: 1) SEQ ID NO: 1; and 2) SEQ ID NO: 2 or 3; and 3) SEQ ID NO: 4; and 4) SEQ ID NO: 5 or
 6. 11. The kit as claimed in claim 10, comprising reagents specific for a combination of 28 target genes.
 12. A method comprising manufacturing the kit of claim
 9. 13. A method comprising manufacturing the kit of claim
 10. 14. A method comprising manufacturing the kit of claim
 11. 15. The method as claimed in claim 12, wherein the at least one specific reagent comprises at least one hybridization probe.
 16. The method as claimed in claim 15, wherein the at least one specific reagent comprises at least one hybridization probe and at least one primer.
 17. The method as claimed in claim 16, wherein the specific reagent comprises one hybridization probe and two primers. 